Method for classifying a new instance

ABSTRACT

A method for classifying a new instance including a text document by using training instances with class including labeled data and zero or more training instances with class including unlabeled data, comprising: estimating a word distribution for each class by using the labeled data and the unlabeled data; estimating a background distribution and a degree of interpolation between the background distribution and the word distribution by using the labeled data and the unlabeled data; calculating two probabilities for that the word generated from the word distribution and the word generated from the background distribution; combining the two probabilities by using the interpolation; combining the resulting probabilities of all words to estimate a document probability for the class that indicates the document is generated from the class; and classifying the new instance as a class for which the document probability is the highest.

TECHNICAL FIELD

The present invention relates to a classification method that determinesthe class of a new data instance (e.g. a text document) using a naiveBayes classifier.

BACKGROUND ART

The naive Bayes classifier is still a popular method for classification,especially in text classification where it often performs at par withthe Support Vector Machines (SVM)-classifier (see Non-Patent Document1). One advantage of the naive Bayes classifier is that it has theinterpretation of a generative model that can be easily extended tomodel more complex relations (e.g. see Non-Patent Document 3).

In order to learn a naive Bayes classifier, for each class z, wedetermine the distribution of words that occur in documents belonging toclass z. Let us denote the word distribution for class z as θ_(z), andthe probability for a specific word w in class z, as θ_(w|z). Often thisdistribution is modeled by a Multinomial distribution. In order toclassifying a new text, the probability of class z given the new text iscalculated by multiplying the probabilities θ_(w|z) for each word w inthe new document.

Note that naive Bayes classifier estimates the probabilities θ_(w|z)using only the training data instances (instances with known class).However, words like “I”, or “the”, that occur often in many documents,independent of the class, often introduced noise, and this way theestimates of θ_(w|z) get unreliable. One approach is to use a stop-wordlist to filter out such words. However, such a stop-word list is staticand depends on the domain of the documents. Another approach is toweight the words by their inverse-document frequency, as suggested, forexample, in Non-Patent Document 1. However, when assigning these weightsthe interpretation of the naive Bayes classifier as a generative modelis lost. As a consequence, the weights and the interaction withparameters of the naive Bayes classifier cannot be learned jointly.Therefore, the weights are either fixed, or must be tuned using part ofthe training data (for example by using cross-validation).

Another line of research tries to improve classification accuracy, byadditionally using instances (e.g. text documents) for which the classis not known. In contrast to training data instances (instances withknown class), such additional instances are often available at largequantities. For example, in contrast to a few newspaper articles thatare manually annotated with a class (e.g. whether the article is about“Animals” or about “Computer”), there is a vast amount newspaperarticles for which not such class information is available (unlabeledinstances). Such an approach to learn a classifier is often referred toas “semi-supervised”. The method in Non-Patent Document 2 describes sucha semi-supervised approach that can improve the estimation of theprobabilities θ_(w|z) by using unlabeled instances. Using theExpectation Maximization (EM)-algorithm to assign class probabilities tounlabeled instances, they are able to estimate θ_(w/z) for words w thatoccur in the unlabeled corpus, but do not occur in the training data.However, their approach does not provide a solution to the problem ofhigh-frequent words.

DOCUMENTS OF THE PRIOR ART

-   [Non-Patent Document 1] Tackling the poor assumptions of naive Bayes    text classifiers, ICML, 2003.-   [Non-Patent Document 2] Text classification from labeled and    unlabeled documents using EM, Machine learning, 2000.-   [Non-Patent Document 3] Comparing Bayesian network classifiers, UAI,    1999

SUMMARY OF INVENTION Technical Problem

The naive Bayes model is not able to down-weight high frequent wordslike “I” or “the” that are often irrelevant for determining the class ofdocument. However, due to the small sample of training data instances,theses irrelevant words might by chance occur more often in one class,than the other. As a consequence, for high-frequent words theprobabilities θ_(w|z) are not spread evenly over all classes z, and thussome documents are wrongly classified due to the presence ofhigh-frequent words.

Solution to Problem

To overcome the above problem, we propose an extended generative modelof the naive Bayes classifier. The extended model introduces abackground distribution γ which is set to the frequency distribution ofthe words in the whole corpus. The whole corpus includes the trainingdata, and can additionally include all other instances for which noclass information is available. The proposed model allows that any wordin the document is either sampled from the distribution θ_(z) defined byits class z, or from the background distribution γ. As a result, theproposed model allows that, especially high-frequent words, areexplained by the background distribution γ rather than by anydistribution θ_(z). In order to decide whether a word is sampled fromthe distribution θ_(z) or from the distribution γ, we introduce a binaryindicator variable d, one for each word in the document. The priorprobability for the parameter d controls how likely it is that a word issampled from γ and this way controls the impact of high-frequent wordson the classification result. The formulation as a generative modelallows us to learn this prior probability efficiently using allinstances (labeled and unlabeled), and thus, it is not needed tomanually tune this prior probability.

Advantageous Effects of Invention

The present invention has the effect of reducing the impact of highfrequent words on the classification result of a naive Bayes classifier.High frequent words often tend to be less informative than middle or lowfrequent words. The proposed method takes this into account, byexplaining the high frequent words by a background distribution (wordfrequency distribution of whole corpus), rather than the worddistribution of any individual class. The proposed method extends thegenerative model of the naive Bayes classifier, and the additionalparameters can be learned from unlabled data (i.e., no need forcross-validation or additional training data).

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a block diagram for the naive Bayes model.

FIG. 2 shows the naive Bayes model in plate notation.

FIG. 3 is a block diagram showing an exemplary embodiment of the presentinvention.

FIG. 4 shows the exemplary embodiment (extension of the naive Bayesmodel) in plate notation.

DESCRIPTION OF EMBODIMENTS

We demonstrate the proposed idea by extending the naive Bayes classifierfor text classification. Given the class z of a document, we assume thateach word in the document is independently generated from a distributionθ_(z). A popular choice for this distribution is the categoricaldistribution (=multinomial distribution for one word occurrence). Usingthe plate notation, we depict the model in FIG. 2. The block diagram isshown in FIG. 1. Let us denote a document as (w₁, . . . , w_(k)) wherew_(j) is the word in the j-th position of the document. Under thismodel, the joint probability of the document with class z is

${p\left( {w_{1},\ldots \mspace{14mu},w_{k},{z\theta_{z}}} \right)} = {{p(z)} \cdot {\prod\limits_{j = 1}^{k}\theta_{{wj}z}}}$

Where θ_(z) is the parameter vector of the categorical distribution,with Σ_(w)θ_(w|z)=1; and p(z) is the prior probability of class z.Accordingly, we have

${p\left( {w_{1},\ldots \mspace{14mu},{w_{k}z},\theta_{z}} \right)} = {\prod\limits_{j = 1}^{k}\theta_{{wj}z}}$

Let us denote by θ the parameter vectors θ_(z), for all classes z. Givena collection of texts D={(t₁,z₁), . . . , (t_(n),z_(n))} with knownclasses, stored in a non-transitory computer storage medium such as ahard disk drive and a semiconductor memory 1 in FIG. 1, we can estimatethe parameters θ_(z) by,

$\begin{matrix}{{\underset{\theta}{\arg \; \max}{p\left( {\theta D} \right)}} = {\underset{\theta}{\arg \; \max}{{p(\theta)} \cdot {p\left( {D\theta} \right)}}}} \\{= {\underset{\theta}{\arg \; \max}{{p(\theta)} \cdot {\prod\limits_{i = 1}^{n}{p\left( {t_{i},{z_{i}\theta}} \right)}}}}} \\{= {\underset{\theta}{\arg \; \max}{{p(\theta)} \cdot {\prod\limits_{i = 1}^{n}{p\left( {{t_{i}z_{i}},\theta} \right)}}}}}\end{matrix}\quad$

using the usual iid-assumption, and that z_(i) is independent from θ.Furthermore, using Equation (1), we get, in a block 10, in FIG. 1,

${\underset{\theta}{\arg \; \max}{p\left( {\theta D} \right)}} = {\underset{\theta}{\arg \; \max}{{p(\theta)} \cdot {\prod\limits_{i = 1}^{n}{\prod\limits_{j = 1}^{k}\theta_{{wj}z}}}}}$

For simplicity, let us assume that p(θ) is constant, then the aboveexpression is maximized by

$\begin{matrix}{\theta_{{wj}z} = \frac{{freq}_{z}(w)}{\sum_{w^{\prime}}{{freq}_{z}\left( w^{\prime} \right)}}} & (1)\end{matrix}$

where freq_(z)(w) is the number of time word w occurs in the collectionof documents that have class z. The prior probability p(z) can beestimated in a similar way, and is constant if the amount of trainingdocuments per class is the same for all classes.

For classifying a new document in a block 20′ in FIG. 1, the naive Bayesclassifier uses

${\underset{z}{\arg \; \min}{p\left( {\left. z \middle| w_{1} \right.,\ldots \mspace{14mu},w_{k}} \right)}} = {\underset{z}{\arg \; \min}{{p(z)} \cdot {\prod\limits_{i = 1}^{k}\theta_{w_{i}|z}}}}$

Let consider a concrete example. Assume that we have two classes, with 5instances each. For example, 5 documents that are about animals (short,z=A), and 5 documents that are about computer (short, z=C). Forsimplicity, we assume that each document has two words. Let us assumethat the word “I” occurs 3 times, and 2 times, in documents that belongto class “Animal”, and class “Computer”, respectively. Therefore, theprobability that word “I” occurs in a document belonging to class“Animal” is 3/10, and the probability that it belongs to class“Computer” is 2/10. Analogously, assume that the word “mouse” occurs 2times, and 3 times in documents in documents that belong to class“Animal”, and class “Computer”, respectively. To summarize we have thefollowing maximum-likelihood (ML) estimates:

${p\left( {z = A} \right)} = {{p\left( {z = C} \right)} = \frac{5}{10}}$${p\left( {\left. {``I"} \middle| z \right. = A} \right)} = \frac{3}{10}$${p\left( {\left. {``I"} \middle| z \right. = C} \right)} = \frac{2}{10}$${p\left( {\left. {``{mouse}"} \middle| z \right. = A} \right)} = \frac{2}{10}$${p\left( {\left. {``{mouse}"} \middle| z \right. = C} \right)} = \frac{3}{10}$

Let us now consider a new document that contains the two words “I” and“mouse”. The class for the new document is decided by considering theratio

$\frac{p\left( {{z = \left. A \middle| {``I"} \right.},{``{mouse}"}} \right)}{p\left( {{z = \left. C \middle| {``I"} \right.},{``{mouse}"}} \right)}$

If this ratio is larger than 1, then the document is classified as“Animal”, if it is smaller than 1 it is classified as “Computer”. Usingthe naive Bayes classifier, this can be we written as follows

$\begin{matrix}{\frac{p\left( {{z = \left. A \middle| {``I"} \right.},{``{mouse}"}} \right)}{p\left( {{z = \left. C \middle| {``I"} \right.},{``{mouse}"}} \right)} = \frac{{p\left( {z = A} \right)} \cdot {p\left( {\left. {``I"} \middle| z \right. = A} \right)} \cdot {p\left( {\left. {``{mouse}"} \middle| z \right. = A} \right)}}{{p\left( {z = C} \right)} \cdot {p\left( {\left. {``I"} \middle| z \right. = C} \right)} \cdot {p\left( {\left. {``{mouse}"} \middle| z \right. = C} \right)}}} \\{= 1.0}\end{matrix}$

Therefore, we see that naive Bayes classifier is not able to distinguishbetween the two classes. However, in general, by inspecting a largecollection of documents, we know that the word “I” is a high frequentword that is not very informative, that is the word is of little helpfor document classification. In contrast the word “mouse” is a morespecific word, and can in general better help to distinguish between twoclasses. As a consequence, the word “mouse” should have a (slightly)more weight for deciding the class, and therefore, the document with thewords “I” and “mouse” should be classified as “Computer”, rather than“Animal”.

We now describe our extension of the naive Bayes model, displayed inFIG. 4 and FIG. 3.

FIG. 3 shows a system, usually performed by a computer system, forclassifying a new data instance, such as a text document. In FIG. 3, aword distribution learning block 10 learns word distributions for eachclass using training data (collection of documents with assignedclasses) stored in a non-transitory computer storage medium 1 such as ahard disk drive and a semiconductor memory similar to the conventionalsystem. Further, in this exemplary embodiment, a background distributionand interpolation parameter learning block 15 learns backgrounddistributions and interpolation parameters using corpus (collection ofdocuments) stored in a computer storage medium 2 such as a hard diskdrive and a semiconductor memory. A classification block 20 is providedwith a new document as a new instance, and classifies the document usingthe word distributions for each class and interpolated with backgrounddistribution provided from the blocks 10 and 15, respectively. Theclassification block 20 then outputs most likely class of input documentas a classification result.

More specifically, under the proposed model, the joint probability ofthe text document with words w₁, . . . , w_(k), hidden variables d₁, . .. , d_(k) and class z is

${p\left( {w_{1},\ldots \mspace{14mu},w_{k},d_{1},\ldots \mspace{14mu},d_{k},z} \right)} = {{p(z)} \cdot {\prod\limits_{j = 1}^{k}{{p\left( {\left. w_{j} \middle| z \right.,d_{j}} \right)} \cdot {p\left( d_{j} \right)}}}}$

where the word probability p(w|z, d) is defined as follows:

${p\left( {\left. w \middle| z \right.,d} \right)} = \left\{ \begin{matrix}\theta_{w|z} & {{{if}\mspace{14mu} d} = 1} \\\gamma_{w} & {{{if}\mspace{14mu} d} = 0}\end{matrix} \right.$

The variables d_(j) are binary random variables that indicate whetherthe word w_(j) is drawn from the class's word distribution θ_(z) or fromthe background distribution γ. The variables dj are hidden variableswhich cannot be observed from the training documents. To acquire theprobability of a training document (w₁, . . . , w_(k), z), we integrateover all d₁, . . . , d_(k), leading to

$\begin{matrix}{{p\left( {w_{1},\ldots \mspace{14mu},w_{k},z} \right)} = {\sum\limits_{d_{1},\ldots \mspace{14mu},d_{k}}^{\;}{{p(z)} \cdot {\prod\limits_{j = 1}^{k}{{p\left( {\left. w_{j} \middle| z \right.,d_{j}} \right)} \cdot {p\left( d_{j} \right)}}}}}} \\{= {{p(z)} \cdot {\prod\limits_{j = 1}^{k}{\sum\limits_{d_{j}}^{\;}{{p\left( {\left. w_{j} \middle| z \right.,d_{j}} \right)} \cdot {p\left( d_{j} \right)}}}}}}\end{matrix}$

We assume, that the prior probability p(d_(j)) is independent from theclass of the document, and independent from the word position j.Therefore, we define δ:=p(d_(j)=1), which is constant for all words.This way the joint probability of the document with class z can beexpressed as follows

$\begin{matrix}{{p\left( {w_{1},\ldots \mspace{14mu},w_{k},z} \right)} = {{p(z)} \cdot {\prod\limits_{j = 1}^{k}\left( {{{p\left( {d_{j} = 0} \right)} \cdot \gamma_{w_{j}}} + {{p\left( {d_{j} = 1} \right)} \cdot \theta_{w_{j}|z}}} \right)}}} \\{= {{p(z)} \cdot {\prod\limits_{j = 1}^{k}\left( {{\left( {1 - \delta} \right) \cdot \gamma_{w_{j}}} + {\delta \cdot \theta_{w_{j}|z}}} \right)}}}\end{matrix}$

For a class z, the word distribution θ_(w|z) can be estimated as beforeusing Equation (1). For estimating the background distribution γ and theprior probability δ in a block 15 in FIG. 3 we use additionally acollection of text documents for which the class is not known, stored ina non-transitory computer storage medium 2 in FIG. 3. Such textdocuments are often available at large quantities. For example, for spamdetection, we might have a few hundred documents for which the label“spam” or “ham” is manually annotated, but thousands of emails that werenot labeled. Let D*={t₁, . . . , t_(n)*} be the collection of alldocuments. (It also includes also the documents for which class label isavailable. Alternatively, D* is the collection of only the documents forwhich no class information is available.) We estimate γ_(w) by using theword distribution in D*, that is

$\begin{matrix}{\gamma_{w} = \frac{{freq}_{D^{*}}(w)}{\sum\limits_{w^{\prime}}^{\;}{{freq}_{D^{*}}\left( w^{\prime} \right)}}} & (2)\end{matrix}$

where freq_(D)*(w) is the frequency of word w in D*. (For example, if D*contains two documents, where in the first document word w occurs 3times, and in the second document it occurs 2 times, then freq_(D*) (w)equals 5.)

The parameter delta can also be learned using D*, as we show later, orcan be set manually to a value between ]0, 1]. Note that, if delta is 1,the model reduces to the original naive Bayes classifier.

Finally, in order to classify a new document w₁, . . . , w_(k) in ablock 20 in FIG. 3, we use

${p\left( {\left. z \middle| w_{1} \right.,\ldots \mspace{11mu},w_{k}} \right)} \propto {{p(z)} \cdot {\prod\limits_{i = 1}^{k}\; \left( {{{p\left( {d_{i} = 0} \right)} \cdot \gamma_{w_{i}}} + {{p\left( {d_{i} = 1} \right)} \cdot \theta_{w_{i}|z}}} \right)}}$

To see that the proposed method can diminish the impact of high frequentwords, consider the same example as before. Let us assume that we haveadditionally 90 documents without class information (unlabeled corpus)in the non-transitory computer storage medium 2 in FIG. 3. We assumethat the word “I” occurs 20 times, and the word “mouse” occurs 10 timesin the unlabled corpus. These 90 documents (unlabled corpus) togetherwith the 10 documents (labeled corpus), for which the class is known,form the complete corpus. For a word w, the probability gamma, isestimates as follows:

$\gamma_{w} = \frac{{number}\mspace{14mu} {of}\mspace{14mu} {times}\mspace{14mu} {word}\mspace{14mu} w\mspace{14mu} {occurs}\mspace{14mu} {in}\mspace{14mu} {complete}\mspace{14mu} {corpus}}{{total}\mspace{14mu} {number}\mspace{14mu} {of}\mspace{14mu} {words}\mspace{14mu} {in}\mspace{14mu} {complete}\mspace{14mu} {corpus}}$

and therefore we have

$\gamma_{I} = \frac{20 + 5}{100*2}$ and$\gamma_{mouse} = \frac{10 + 5}{100*2}$

The class probabilities θ_(w|z) for the words “I” and “mouse” are set tothe probabilities p(w|z) of the original naive Bayes model, i.e.:

$\theta_{I|A} = \frac{3}{10}$ $\theta_{I|C} = \frac{2}{10}$$\theta_{{mouse}|A} = \frac{2}{10}$ $\theta_{{mouse}|C} = \frac{3}{10}$

Furthermore, for simplicity we assume that δ is set to 0.5, that meansp(d_(j)=1)=p(d_(j)=0)=0.5, for all j. Let us now consider the documentcontaining the two words “I” and “mouse”, which class is decided by thefollowing ratio

$\begin{matrix}{\frac{p\left( {{z = \left. A \middle| {``I"} \right.},{``{mouse}"}} \right)}{p\left( {{z = \left. C \middle| {``I"} \right.},{``{mouse}"}} \right)} = \frac{\left( {\theta_{I|A} + \gamma_{I}} \right) \cdot \left( {\theta_{{mouse}|A} + \gamma_{mouse}} \right)}{\left( {\theta_{I|C} + \gamma_{I}} \right) \cdot \left( {\theta_{{mouse}|C} + \gamma_{mouse}} \right)}} \\{= {\frac{0.425 \cdot 0.275}{0.325 \cdot 0.375} \approx 0.96 < 1}}\end{matrix}$

Therefore the document is classified as a “Computer” article, which isin contrast to before using the (original) naive Bayes classifier. Wecan that here the weight of word “mouse” dominates the weight of word“I”, which is a high frequent word. In general, high frequent word get alower weight for deciding the class, and therefore their (negative)impact is diminished.

We note that the above example holds in more general. Consider adocument that has two words a and b, and θ_(a|z1)=θ_(b|z2), andθ_(a|z2)=θ_(b|z1). Without loss of generalization, we assume thatθ_(a|z1)>θ_(a|z2). In words, this means word a suggests class z₁ exactlyas strong as word b suggests class z₂. Furthermore, let δ be in ]0,1[.Assuming that both prior probabilities of class z₁ and z₂ are the same,we can see whether the document a, b is classified as class z₁ or z₂ byinspecting the ratio:

$\frac{{\left( {1 - \delta} \right) \cdot \gamma_{a}} + {\delta \cdot \theta_{a|z_{1}}}}{{\left( {1 - \delta} \right) \cdot \gamma_{a}} + {\delta \cdot \theta_{a|z_{2}}}} \cdot \frac{{\left( {1 - \delta} \right) \cdot \gamma_{b}} + {\delta \cdot \theta_{b|z_{1}}}}{{\left( {1 - \delta} \right) \cdot \gamma_{b}} + {\delta \cdot \theta_{b|z_{2}}}}$

If the ratio is larger than 1, the document is classified as class z₁,if the ratio is smaller than 1 the document is classified as class z₂.We can show that this ratio is smaller than 1, if, and only if,γ_(a)>γ_(b). Therefore, if the word b is less frequent than a, theweight of word b becomes higher than the weight of word a.

As a consequence, the proposed method can have a similar effect asidf-weighting, in the sense, that it mitigates high-frequency words.Note that, a naive Bayes classifier cannot classify the document,because in that case we would be directly on the decision boundary.

Proof of the Above Statement:

To simplify notation let

γ′_(a):=(1−δ)·γ_(a)

γ′_(b):=(1−δ)·γ_(b)

θ′_(a|z) ₁ :=δ·θ_(a|z) ₁

θ′_(b|z) ₁ :=δ·θ_(b|z) ₁

θ′_(a|z) ₂ :=δ·θ_(a|z) ₂

θ′_(b|z) ₂ :=δ·θ_(b|z) ₂

since θ_(a|z1)=θ_(b|z2), and θ_(a|z2)=θ_(b|z1), and θ_(a|z1)>θ_(a|z2),we then have

$\left. {{\frac{\gamma_{a}^{\prime} + \theta_{a|z_{1}}^{\prime}}{\gamma_{a}^{\prime} + \theta_{a|z_{2}}^{\prime}} \cdot \frac{\gamma_{b}^{\prime} + \theta_{b|z_{1}}^{\prime}}{\gamma_{b}^{\prime} + \theta_{b|z_{2}}^{\prime}}} < 1}\Leftrightarrow{{\frac{\gamma_{a}^{\prime} + \theta_{a|z_{1}}^{\prime}}{\gamma_{a}^{\prime} + \theta_{a|z_{2}}^{\prime}} \cdot \frac{\gamma_{b}^{\prime} + \theta_{a|z_{2}}^{\prime}}{\gamma_{b}^{\prime} + \theta_{a|z_{1}}^{\prime}}} < 1}\Leftrightarrow{{\left( {\gamma_{a}^{\prime} + \theta_{a|z_{1}}^{\prime}} \right) \cdot \left( {\gamma_{b}^{\prime} + \theta_{a|z_{2}}^{\prime}} \right)} < {\left( {\gamma_{a}^{\prime} + \theta_{a|z_{2}}^{\prime}} \right) \cdot \left( {\gamma_{b}^{\prime} + \theta_{a|z_{1}}^{\prime}} \right)}}\Leftrightarrow{{{\gamma_{a}^{\prime} \cdot \theta_{a|z_{2}}^{\prime}} + {\gamma_{b}^{\prime} \cdot \theta_{b|z_{1}}^{\prime}}} < {{\gamma_{a}^{\prime} \cdot \theta_{a|z_{1}}^{\prime}} + {\gamma_{b}^{\prime} \cdot \theta_{a|z_{2}}^{\prime}}}}\Leftrightarrow{{\gamma_{a}^{\prime} \cdot \left( {\theta_{a|z_{2}}^{\prime} - \theta_{a|z_{1}}^{\prime}} \right)} < {\gamma_{b}^{\prime} \cdot \left( {\theta_{a|z_{2}}^{\prime} - \theta_{b|z_{1}}^{\prime}} \right)}}\Leftrightarrow{\gamma_{a}^{\prime} > \gamma_{b}^{\prime}}\Leftrightarrow{\gamma_{a} > \gamma_{b}} \right.$

It is not difficult to see that the parameter δ controls how much theimpact of high frequent words is reduced. We will now show that theparameter can be learned from the corpus D*. We suggest to set δ suchthat if there are many high-frequent words in D* that cannot beexplained by any θ_(z), the parameter δ is closer to 0. We can achievethis by choosing the parameter δ* that maximizes p(D*) under ourproposed model for fixed parameters θ_(z) and γ.

This means

$\begin{matrix}\begin{matrix}{\delta^{*}:={\underset{\delta}{{\arg \; \max}\;}{p\left( D^{*} \right)}}} \\{= {\underset{\delta}{{\arg \; \max}\;}{\prod\limits_{i = 1}^{n^{*}}{\sum\limits_{z_{i}}^{\;}{{p\left( z_{i} \right)} \cdot {\prod\limits_{j = 1}^{k_{i}}\left( {{\left( {1 - \delta} \right) \cdot \gamma_{w_{j}}} + {\delta \cdot \theta_{w_{j}|z}}} \right)}}}}}}\end{matrix} & (3)\end{matrix}$

To find an approximate solution to this problem we can, for example, usethe EM-algorithm, considering all class labels zi and all indicatorvariables d_(j) as unobserved.

We note that, in the same way as δ, it also possible to estimate γinstead of setting it to the word frequency distribution (as in Equation(2)). In doing so, for high-frequent words w that can be well explainedby a class z, i.e. θ_(w|z) is high, the probability γ_(w) is reduced.This has the advantage, that such high-frequent words w remain to have ahigh weight that favors class z.

For simplicity, in this example, we set the probability θ_(w|z) and theprobability γ_(w) to the categorical distribution (or multinomialdistribution without the combinatorial factor for the word frequency).However, in practice for modelling text it is advantageous to useinstead a mixture distribution model, most notably a mixture ofmultinomial distributions like in Non-Patent Document 2. The number ofcomponents can be determined using cross-validation, and the wordprobabilities for each component can be learned, for example, using theEM algorithm from labeled and unlabeled data. It is also possible toassume an infinite mixture model, by placing a Dirichlet process priorover the number of components. In that case, the probabilities θ_(w|z)and the probability γ_(w) can be estimated using Markov-Chain-MonteCarlo (MCMC) methods.

As an alternative to Equation (3), we can set the interpolationparameter δ such that the expected document classification accuracy isoptimized. This can be achieved by using cross-validation on thetraining data instances with class information (i.e. labeled data).

The method for classifying a new data instance, such as a text documentof the above exemplary embodiments may be realized by dedicatedhardware, or may be configured by means of memory and a DSP (digitalsignal processor) or other computation and processing device. On theother hand, the functions may be realized by execution of a program usedto realize the steps of the method for classifying a new data instance,such as a text document.

Moreover, a program to realize the steps of the method for classifying anew data instance, such as a text document, may be recorded oncomputer-readable storage media, and the program recorded on thisstorage media may be read and executed by a computer system to performthe method for classifying a new data instance, such as a text document,processing. Here, a “computer system” may include an OS, peripheralequipment, or other hardware.

Further, “computer-readable storage media” means a flexible disk,magneto-optical disc, ROM, flash memory or other writable nonvolatilememory, CD-ROM or other removable media, or a hard disk or other storagesystem incorporated within a computer system.

Further, “computer readable storage media” also includes members whichhold the program for a fixed length of time, such as volatile memory(for example, DRAM (dynamic random access memory)) within a computersystem serving as a server or client, when the program is transmittedvia the Internet, other networks, telephone circuits, or othercommunication circuits.

For convenience, we use the term “word” to describe a feature in thepresent specification and claims below. However, we note that the methodcan also be applied for other features that are not lexical.

INDUSTRIAL APPLICABILITY

The present invention allows to classify an input text with a naiveBayes classifier without previous feature selection that removes highfrequent words (like stop-words) that are uninformative. Featureselection is known to improve the performance of a classifier, since itremoves noise. However, feature selection needs to be done partlymanually, involving additional costs. The present invention allows toautomatically determine how to diminish the impact of high frequentwords' noise by learning word distributions from unlabeled text. Thatmeans, no parameters need to be manually tuned, and no additionalmanually labeled training data is necessary. The present invention isformulated as extension of the generative process of the naive Bayesclassifier, which allows it to be easily extended to model more complexinteraction of words, or to model words and additional other types ofattributes (e.g. for spam detection, the actual email text+additionalattributes like number of times email from same sender was removed.). Asa consequence the present invention allows high text classificationaccuracy, without additional costs.

1. A method for classifying a new instance including a text document byusing a collection of training instances with known class includinglabeled data and zero or more training instances with unknown classincluding unlabeled data, comprising: estimating a word distribution foreach class by using the labeled data and the unlabeled data; estimatinga background distribution, and a degree of interpolation between thebackground distribution and the word distribution by using the labeleddata and the unlabeled data; calculating, for each word of the newinstance, a first probability for that the word is generated from theword distribution and a second probability for that the word isgenerated from the background distribution; combining the firstprobability with the second probability by using the interpolation;combining the resulting probabilities of all words to estimate adocument probability, for the class, that indicates the document and isgenerated from the class; and classifying the new instance as a classfor which the document probability is the highest.
 2. The methodaccording to claim 1, wherein in estimating the background distribution,estimating the background distribution such that a probability ofobserving the collection of all of the training instances with the knownclass and the unknown class is maximized.
 3. The method according toclaim 1, wherein the background distribution is set to a word frequencydistribution as observed in all instances with the known class and theunknown class.
 4. The method according to claim 1, wherein theinterpolation parameter is set such that an expected documentclassification accuracy is optimized.
 5. The method according to claim1, wherein the word distribution for each class and/or the backgrounddistribution is set to a multinomial distribution or a mixture ofmultinomial distributions, and is estimated by using the labeled data orby using both of the labeled data and the unlabeled data.